Improving
the performance

of word frequencies
in authorship attribution

Maciej Eder (maciej.eder@ijp.pan.pl)

6.07.2022


JADT2022 conference
Napoli, 6–8.07.2022

introduction

stylometry

  • measures stylistic differences between texts
  • oftentimes aimed at authorship attribution
  • relies on stylistic fingerprint, …
  • … aka measurable linguistic features
    • frequencies of function words
    • frequencies of grammatical patterns, etc.
  • proves successful in several applications

areas of improvement

  • classification method
    • distant-based
    • svm, nsc, knn, …
    • neural networs
  • feature engineering
    • dimension reduction
    • lasso
  • feature choice
    • MFWs
    • POS n-grams
    • character n-grams

relative frequencies

simple normalization

Occurrences of the most frequent words (MFWs):

## 
##  the  and   to    i   of    a   in  was  her   it  you   he  she that  not   my 
## 4571 4748 3536 4130 2224 2326 1484 1127 1551 1391 1895 2138 1338 1250  937 1106

Relative frequencies:

## 
##    the    and     to      i     of      a     in    was    her     it    you 
## 0.0383 0.0398 0.0296 0.0346 0.0186 0.0195 0.0124 0.0094 0.0130 0.0116 0.0159

relative frequencies

The number of occurrences of a given word divided by the total number of words:

\[ f_\mathrm{the} = \frac{n_\mathrm{the}}{ n_\mathrm{the} + n_\mathrm{of} + n_\mathrm{and} + n_\mathrm{in} + ... } \]

In a generalized version:

\[ f_{w} = \frac{n_{w}}{N} \]

relative frequencies

  • routinely used
  • reliable
  • simple
  • intuitive
  • conceptually elegant

words that matter

synonyms

Proportions within synonym groups might betray a stylistic signal:

  • on and upon
  • drink and beverage
  • buy and purchase
  • big and large
  • et and atque and ac

\[ f_\mathrm{on} = \frac{n_\mathrm{on}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

\[ f_\mathrm{upon} = \frac{n_\mathrm{upon}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

limitations of synonyms

‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

beyond synonyms

semantic similarity

word vector models

results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

the template

This presentation introduces a template for the project CLS INFRA and the reveal.js framework, but it is meant to be used within the R programming environment with the package rmarkdown active. The template is based on Ingo Börner’s previous work, except that the original template is meant to be used with a low-level installation of reveal.js.

conclusion

work in progress

  • the template works just fine in most contexts…
  • … but it still needs some tweaks
  • e.g. the next slide still doesn’t hide the project’s logo in the corner
  • therefore, please expect updates in the future